Policy Gradients with Memory-Augmented Critic
نویسندگان
چکیده
منابع مشابه
Solving Deep Memory POMDPs with Recurrent Policy Gradients
This paper presents Recurrent Policy Gradients, a modelfree reinforcement learning (RL) method creating limited-memory stochastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic elig...
متن کاملOff-Policy Actor-Critic
This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in offpolicy gradient temporal-difference learning....
متن کاملClassification-based Policy Iteration with a Critic
In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a ...
متن کاملTunnel ventilation control via an actor-critic algorithm employing nonparametric policy gradients
The appropriate operation of a tunnel ventilation system provides drivers passing through the tunnel with comfortable and safe driving conditions. Tunnel ventilation involves maintaining CO pollutant concentration and VI (visibility index) under an adequate level with operating highly energy-consuming facilities such as jet-fans. Therefore, it is significant to have an efficient operating algor...
متن کاملLinear Off-Policy Actor-Critic
This paper presents the first actor-critic algorithm for o↵-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in o↵policy gradient temporal-di↵erence learning. O↵...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Transactions of The Japanese Society for Artificial Intelligence
سال: 2021
ISSN: ['1346-0714', '1346-8030']
DOI: https://doi.org/10.1527/tjsai.36-1_b-k71